Method and apparatus for correlation of events in a distributed multi-system computing environment

ABSTRACT

A method and system is disclosed for monitoring an operation of a distributed data processing system. The system can include a plurality of applications running on a plurality of host processors and communicating with one another, such as through a message passing technique. The method includes steps executed in individual ones of the plurality of applications, of (a) examining individual ones of generated API calls to determine if a particular API call meets predetermined API call criteria; (b) if a particular API call meets the predetermined API call criteria, storing all or a portion of the content of the API call as a stored event; (c) processing a plurality of the stored events to identify logically correlated events, such as those associated with a business transaction; and (d) displaying all or a portion of the stored API call content data for the logically correlated events.

FIELD OF THE INVENTION

The invention relates generally to methods and apparatus for correlatingevents attributable to computer programs residing on different computersystems in a distributed network, and more particularly relates totechniques and systems for tracing problem events to their source andfacilitating their resolution.

BACKGROUND OF THE INVENTION

As the complexity of computer systems and networks of computer systemsincrease, it becomes more complex and time consuming to trace andresolve problems. This is especially true in large distributed systemswhere multiple computer programs are concurrently running in multiplecomputer systems.

Typically, experienced software developers are used to monitor each ofthese systems and combine the individual analyses in order to obtain acoherent, global view of the operation of the distributed dataprocessing system.

In accordance with current methodologies this is a very manual and laborintensive process, and requires unique skills in the various computeroperating environments that make up the distributed system. Furthermore,the inputs to the analysis, such as event and message tracing data, arenot in common formats across the various systems. These factors combineto make it a very tedious, error prone, slow and costly process toattempt to correlate these various disparate data traces into a coherentmodel of the operation of the distributed data processing system.

Furthermore, the traditional error diagnosis processes typically employa debugger, which is intrusive, or an embedded error logging facility,which normally requires that source code modifications be made.

The deficiencies of the prior art approach to problem identification andresolution have become more prominent as large scale distributedbusiness enterprise systems have been developed, wherein a plurality ofdifferent applications running on different hosts and under differentoperating systems all cooperate via message passing techniques toprocess input data related to independent and asynchronous transactions.A type of management software known as “middleware” has been developedto control and manage the message flow and processing, and employsmessage queues to temporally isolate the various applications from oneanother. In such a system several thousand transactions may besimultaneously in process, resulting in corresponding thousands ofApplication Program Interface (API) calls and messages beingconcurrently generated and routed through the system.

As can be appreciated, identifying a cause of a failure or errorcondition occurring in one or a few of these transactions can be verycomplex, time consuming and, because of the significant amount of humanoperator analysis required, error prone.

OBJECTS AND ADVANTAGES OF THE INVENTION

It is a first object and advantage of this invention to provide a methodand system for providing logical diagnostic information for events, suchas API calls, call arguments and return values, for a distributed dataprocessing system wherein transactions occur over a plurality of hostsand applications.

It is another object and advantage of this invention to provide a methodand system for sensing and capturing, in a distributed manner, anoccurrence of events including API calls, call arguments and returnvalues, for automatically correlating captured events relating to aparticular distributed transaction, and for displaying the correlatedevents to a human operator in a logically consistent manner.

SUMMARY OF THE INVENTION

The foregoing and other problems are overcome and the foregoing objectsand advantages are realized by methods and apparatus in accordance withembodiments of this invention.

The teachings of this invention solve the above-mentioned problems byproviding a uniform framework for capturing, managing, and correlatingevents from heterogenous environments. In a presently preferred, but notlimiting, embodiment the teachings of this invention support theautomatic correlation of IBM™ MQSeries™ (IBM and MQSeries are trademarksof the International Business Machines Corporation) API events, as wellas a human user-assisted correlation of similar events, through an eventmodelling scheme and user management interface.

More specifically, this invention provides the following novelprocesses, systems and sub-systems.

In a first aspect this invention provides a design and implementation ofan infrastructure for intercepting function calls, such as API calls,and generates events representing the corresponding function call fromdifferent computer programs in a distributed computing environment. Thisprocess is conducted in a non-intrusive manner. The infrastructuresupports the conditional collection of a subset of event data through adata collection filter mechanism.

In a second aspect this invention provides a set of data structures formodeling function calls and data structures, software programs, andmiscellaneous computer system resources (e.g., IBM™ MQSeries™ queuemanagers) of heterogeneous technologies. These data structures exposethe event internals through a uniform set of interfaces.

In a third aspect this invention provides for the development andrealization of the concept of event relations for modeling a messagepath relation between a send and receive event, which is an importantelement in an event correlation algorithm. An algorithm for thesystematic examination of events and the generation of correspondingevent relations is also provided.

In a fourth aspect this invention provides an interface built on top ofan internal event model for exposing internal details of collectedevents through, for example, Microsoft COM object models.

In a fifth aspect this invention provides an algorithm for the automaticcorrelation of IBM™ MQSeries™ events from different software programsthat are involved in the same local and/or business transactions.

In a further aspect this invention provides a mechanism to allow a humanuser to select a subset of collected events according to a set ofevaluation criteria based on the event internal data. The user canachieve this selection through the use of a scripting language, such asMicrosoft Visual Basic™ scripts, and a human interface.

These various aspects of the invention provide a unique perspective tomanage the collection and correlation of events in a distributedcomputing environment in the following manner.

First, event collection is handled in a non-intrusive manner. That is,no additional work (source code modification, recompilation, linking,etc.) is needed on the monitored software programs for event generation.Moreover, a human user need not have any knowledge of the internals ofthe software programs that he/she is monitoring. This contrastsfavorably with the traditional diagnosis process, including those thatuse the debugger (intrusive) or the embedded logging (through sourcecode modifications) approaches.

Second, event collection can be triggered by the fulfillment of a set ofcriteria based on, for example, software program running states andcomputing environments. In other words, event collection is in general“disabled” for avoiding any interruption of normal program execution,and then automatically enabled for responding to an error condition or achange in program states or environments. When enabled by the triggeringevent(s), the sensor can send all event data that satisfies a specificdata collection filter.

Third, an amount of data to be collected from the software programs canbe decided both statically (through pre-programmed filtering conditions)and dynamically (such as from certain environment and program states).

Fourth, the human user can control the monitoring activities in adistributed computing environment from one central console.

Fifth, event correlations for transaction analysis can be accomplishedusing an automatic correlation mechanism, thereby eliminating orreducing the involvement of highly skilled software programmers.

Sixth, a user interface is provided for enabling a human user oroperator to visualize and analyze subset(s) of events selected byuser-defined selection criteria. In the presently preferred embodimentthese selection criteria are defined through the use of Microsoft VisualBasic™ scripts. The operator has the ability to modify and customize thescripts to tailor the presentation to a desired format and content. Thescript may also be automatically generated by entry of data into a fewfields in a presentation filter dialogue box.

A method and system is therefore disclosed for monitoring an operationof a distributed data processing system. The system is a type of systemthat includes a plurality of applications running on a plurality of hostprocessors and communicating with one another, such as through a messagepassing technique. The method has steps executed in the plurality ofapplications for: (a) examining individual ones of generated ApplicationProgram Interface (API) calls to determine if a particular API callmeets predetermined API call criteria; (b) if a particular API callmeets the predetermined API call criteria, storing all or a portion ofthe content of the API call as a stored event; (c) processing aplurality of the stored events to identify logically correlated events,such as those associated with a business transaction; and (d) displayingall or a portion of the stored API call content data for the logicallycorrelated events. The API call criteria can include, by example, systementity identity, the API name, timing data and/or restrictions onparameter values to the API call. The step of displaying preferablyincludes a step of processing the stored API call content data for thelogically correlated events using a script (pre-programmed,automatically generated, or operator-defined). The step of examiningincludes initial steps of: installing a sensor between an output of theapplication and a function call library for emulating, relative to theapplication, the interface to the function call library; and storing thepredetermined API call criteria in a memory that is accessible by thesensor. The step of examining then further includes steps ofintercepting with the sensor an API call output from the application;determining if the intercepted API call fulfills the storedpredetermined API call criteria; and, if a match occurs, capturing datarepresenting all or a portion of the content of the API call andtransmitting the captured data to a database for storage as the storedevent.

BRIEF DESCRIPTION OF THE DRAWINGS

The above set forth and other features of the invention are made moreapparent in the ensuing Detailed Description of the Invention when readin conjunction with the attached Drawings, wherein:

FIG. 1 is block diagram illustrating an exemplary monitoring environmentin accordance with the teachings herein;

FIGS. 2–10 are each a logic flow diagram or a logic model, wherein

FIG. 2 depicts sensor work flow;

FIG. 3 depicts an analyzer data logic model;

FIG. 4 depicts an analyzer logic model;

FIG. 5 depicts analyzer new event handling work flow;

FIG. 6 depicts analyzer event relation generation flow;

FIG. 7 illustrates a COM model interface;

FIG. 8 illustrates a presentation data filtering operation;

FIG. 9 illustrates a first embodiment of transaction correlation;

FIG. 10 illustrates a second embodiment of transaction correlation;

FIG. 11 is a table that illustrates a number of exemplary standard eventattributes, and is referenced below in the description of the data modelof FIG. 3;

FIG. 12 is a simplified block diagram illustrating a relationshipbetween a sensor, an application, and a call library emulated by thesensor;

FIG. 13 is a block diagram of an exemplary distributed enterprisemiddleware-based system that includes the analyzer and relatedcomponents in accordance with the teachings herein;

FIG. 14 is a conceptual block diagram of the analyzer console and itsinterface with sensors;

FIG. 15 shows an exemplary content of a log file used to record messagetraffic after a tracing facility is enabled;

FIG. 16 is an exemplary dynamic transaction visualization of messageflow and API calls in the distributed enterprise middleware-based systemof FIG. 13; and

FIG. 17 illustrates how the captured event data can be visualized in anevent details mode.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 illustrates an exemplary analyzer monitoring environment. Ananalyzer system 10 in accordance with the teachings herein comprises twomajor sub-systems: an analyzer 12 (also referred to herein as ananalyzer console) and a plurality of sensors 14. The sensors 14 may beconsidered as agents that reside in the space of a monitored process,and operate to collect information on calls of the particular technologythat a particular sensor 14 is monitoring.

Referring briefly to FIG. 12, for Microsoft and UNIX™ Platforms (UNIX isa trademark of X/Open Company, Limited) a sensor 14 library 14Bimplements all of the API entry points for the technology that theparticular sensor 14 monitors. The sensor library 14B is named exactlyas a standard call library 13, and is installed in a manner such thatany monitored process or application 16 will interface at runtime withthe sensor library 14B, instead of the standard library 13. This processis conducted in a non-intrusive manner and does not require anyadditional recompilation or relinking of the user application.

For an OS/390™ platform (OS/390 is a trademark of the InternationalBusiness Machines Company), in particular for the MQSeries™, a differentapproach makes use of the crossing exit mechanism provided by CICS™(CICS is a trademark of the International Business Machines Company).This approach also maintains the non-intrusive manner of the sensor 14injection process.

Referring also to FIG. 1, during the execution of the user application16, control is passed via path 101 to the associated sensor 14 whenevera monitored API is invoked. In response, the sensor 14 performs thenecessary work to generate an event representing the API call state. Thegeneration of the event is triggered by the API fulfilling requirementsstored in a sensor configuration filter 14A (FIG. 12), which isprogrammed with configuration commands or messages by the analyzer 10.

A human operator employs the analyzer console 12, also referred to asthe analyzer user interface (UI), for controlling the activities of thesensors 14, for visualizing the collected event data, and for performingdata analysis. The analyzer console 12 sends out the sensor 14configuration messages through a MQSeries™-based asynchronouscommunication network 15. This process is illustrated by path 104(analyzer to Queue Manager/Queue 18) and path 102 (Queue Manager/Queue18 to sensor 14) in FIG. 1. The sensor 14 also makes use of the samecommunication network 15 to pass captured event(s) to the analyzerconsole 12 via paths 103 and 105. The collected events are stored in alocal event database 20 associated with the analyzer 12, via paths 106and 107.

FIG. 2 illustrates the control flow of the sensor 14. At step 210 anapplication 16 makes a function call belonging to the set of functionsmonitored by the associated sensor 14. In the preferred embodiment, atstep 212, a tricoder function is invoked instead of the standardfunction. A tricoder function yields program control to the sensor 14via path 201 for analyzer 10 related processing.

In step 214, the sensor 14 first manages the configuration database 14A,also referred to herein as a configuration queue, in the analyzercommunication network 15. This management function includes examiningreceived configuration messages on the configuration queue, removingexpired messages, and retrieving newly arrived messages. At step 216 thesensor 14 examines each of the newly arrived messages retrieved fromstep 214 and updates the internal data structures. Each configurationmessage contains a set of data collection filter rules. These rulesdetermine the conditions which trigger event generation/reporting, aswell as an amount of information to be collected from the event datapacket. The filter rule conditions are preferably based on system entityidentity (e.g., software program name, host machine name, queue managername, etc.), API name, timing information, and/or restrictions onparameter values to the API call, as described in further detail below.

At step 218 the sensor 14 determines if any of the existing filter rulesmatch the current program state. If there is a matching event, thesensor 14 generates the event, thereby capturing the state of thetriggering function call (step 220). If there is no matching event, atstep 222 the sensor 14 instead invokes the standard API. The sensor 14subsequently returns control to the application 16.

The amount of information contained in the generated event depends onthe filter rule specification. The filter rule specification determineswhether function call parameters are to be sent, and the range of userdata to be carried along with the event packet. For example, aparticular packet may include some thousands of bytes of user message,and the filter rule specification may cause only the first 16 bytes tobe captured and stored as part of the event, or may specify that none ofthe user message data be captured and saved. The filter rulespecification(s) thus controls the type and amount of data that iscaptured and stored upon the filter rule matching the current programstate.

In some cases the amount of captured data may be made dynamic, e.g., asa function of the current environment or operating state of thesystem/processor being monitored.

It is also possible to repeat steps 218 and 220 after the standard APIcall returns control to the sensor 14, in order to generate an eventrepresenting the post-call state. This recursion is indicated by thedashed line 226.

FIG. 3 illustrates the data model used by the analyzer system 10 tostore and represent the function call states and monitored environmentin a hierarchical/networked manner. The program 310, host 312, andprogram instance 314 data types represent the system entities in amonitored environment, where an entity is any object in the monitoredsystem that exists for a certain length of time. Note that a programinstance 314 is always associated with a program 310 and a host machine312. The program instance 314 can be considered as a process and threadof execution in a UNIX™/Microsoft Windows™ environment (Windows is atrademark of the Microsoft Corporation), and as aregion-transaction-task in the OS/390™ CICS™ environment.

A resource 316 is an entity that is specific to a particular technologymonitored by the analyzer 10. For example, for the MQSeries™, the queuemanager and the queues are considered to be a resource 316. One type ofresource 316 can be associated with another (e.g.: Queue Manager and theassociated Queue, shown collectively as 18 in FIG. 1).

An event entry represents the captured state of a function callcollected by one of the sensors 14 in the system 10. That is, it is theinternal storage for the event packets collected from different sensors14. An event entry is associated with a program instance and optionallyone or more resources. The event data can be divided into two groups:standard or technology neutral event information 318 and technologyspecific event information 320. The former includes information that iscommon among different technologies. FIG. 11 is a table that illustratesa number of exemplary standard event attributes. It should be noted thatthe entity origin information including host name, program name, programinstance identifier, and resource name (level 1 and level 2) can beaccessed through the entity and resource entries associated with therespective event entry.

The technology specific event information 320 contains function callparameters and a user data buffer. User data refers to the informationparticular to the application 16, and not the technology and functionset. The technology specific event information 320 is divided into twosections, one covers the data captured before the standard function call(entry data), and one covers the data captured after the standardfunction call (exit data).

Each event entry is associated with a group of event relationships 322.There can be different types of relationships defined for events. Oneimportant type of relationship considered by the analyzer 10 of thisinvention is the message path relation. The message path relationassociates events that serve as the source and destination of a messagetransaction between two entities in the monitored system. The concept ofmessage path relation is generic for different technologies, and isrealized by a specific relationship type for each technology monitoredby the analyzer 10. As an example, for the MQSeries™ it is realized bythe MQPUT-MQGET type relation that associates MQPUT/MQPUT1 and MQGETcalls dealing with the same message. In general, an MQPUT call puts dataon a queue, while the MQGET call takes data from a queue.

A lookup table 324, similar to a hash table, is used for storingkey-value mapping. Each entry in the lookup table 324 contains at leasta technology name, a key type, a key value, and value list. The valuelist contains a set of events that bear the same key value. For theMQSeries™ example, the key type is based on a combination of Message ID,Correlation ID, and Message Time. This allows the analyzer 10 to groupMQPUT/MQPUT1/MQGET events bearing the same message ID, correlation ID,and message time, and to then look up the event in an efficient manner.This is particularly useful for deriving a message path relation.

FIG. 4 illustrates the logic model 718 (see FIG. 7) defined for theanalyzer 10. Recalling first that event data can be divided into astandard and technology specific section, the data format for thetechnology-specific section is different for different technologies. Theanalyzer 10 logic model provides a uniform way for exposing thetechnology specific data to different components of the analyzer 10.

As was indicated previously, the technology-specific event data sectionin the data model covers the function call parameters and the user databuffer. Call parameters bear different data types specific to thecorresponding technology. Moreover, it is possible that the user databuffer may have embedded structures of technology-specific data types.The analyzer logic model 718 is comprised of a Method/Function 410 andan analyzer data type 412.

The analyzer logic model 718 defines a class BCMethod for representingany API or class methods. BCMethod objects store the call parameternames and corresponding analyzer logic model data type (describedbelow).

The analyzer logic model 718 also defines the base class BCType forrepresenting any technology-specific data types. A BCType (or derivedclass) object contains one or more display string generators 414 and adata locator 416.

A given one of the display string generators 414 contains functions forproducing a string formatted in a particular way for display purposes.It is defined by a display format string and the logic for generatingsuch a string. The data locator 416 aids in determining the exactlocation of the runtime data for a particular call parameter and type inthe technology specific event data section. By combining the datalocator 416 and the runtime event data, the analyzer 10 is enabled toaccess any call parameter value in an event record. The display stringgenerator 414 associated with the BCType object can then make use ofthis data pointer and produce the string representing the parametervalue.

It should be noted that the string being generated need not be tied withany technology-specific detail, and hence can be used and understood bythe technology neutral components of the analyzer 10.

On the other hand, other components (e.g., an analyzer filter manager asdescribed below) can use the data locator 416 to refer to thetechnology-specific raw event data value. In this case, the analyzercomponent utilizes a technology helper library designed specifically forthe corresponding technology to interpret the event value. Differentderived classes based on BCType are designed to cover differenttechnology data types or classes, as now described.

A first technology data class is a BCBasicType (derived from BCType).This class represents any atomic native data type. That is, the nativedata type cannot be broken into other native data types. For example,fundamental data types such as ‘integer’ and ‘character’ can berepresented by BCBasicType objects. This class can optionally carrydefinitions of mapping between integer/character values and meaningfulenumerator strings. Many times such integer or character constant valuesare represented by a human readable enumerator string (e.g.: MQCC_OK(0)in the MQSeries™ completion code definitions). The BCBasicType classcontains information relating to this type of mapping.

A second technology data class is a BCCompoundOptionType (also derivedfrom BCBasicType), which is similar to BCBasicType. This class allowsmapping of multiple enumerator names to a single value.

A third technology data class is a BCEnumType (also derived fromBCBasicType). This class is also similar to BCBasicType except that itis not applied to any runtime event value. Instead, it provides a staticdefinition of enumerators. This can be useful to represent theenumerator concepts in programming languages such as C++.

A fourth technology data class is a BCCompositeType (derived fromBCType). This class type serves as a container class and containsreference to other BCType objects and BCMethod objects. TheBCCompositeType can be used to model classes and structures in mostconventional programming languages such as C, C++, Java, etc.

A fifth technology data class is a BCArrayType (derived from BCType).This type is used to model the array type in conventional programminglanguages. It is preferably always associated with a BCType class thatrefers to the data type the array type builds on top of, and it providesa mechanism for accessing a particular element in the array of runtimeevent data.

A sixth technology data class is a BCPointerType (derived from BCType).This type is used to model the pointer type in programming language suchas C and C++. It is preferably always associated with a BCType classthat refers to the data type the pointer type is associated with.

A seventh technology data class is a BCDynamicType (derived fromBCType). This type is used in situations where the layout of the datamay vary according to the runtime event data. For example, and referringagain to the MQSeries™ example, it is possible to have differentMQSeries™ structures embedded in the user data buffer. The BCDynamicTypehas the capability of generating runtime children type objects toreflect the event data layout.

FIG. 5 is a logic flow diagram that illustrates the work flow of theanalyzer 10 for handling a new incoming event. Operation of the analyzer10 begins with different threads of execution. Within an individualthread, at step 510, the analyzer 10 collects events originated from oneor more particular sensors 14. The event queue distribution scheme isbased on the sensor 14 configuration messages. In other words, theconfiguration message to a particular one of the sensors 14 defines theevent queue that the sensor 14 should report to.

For each event collected, at step 512 the analyzer 10 performs anynecessary data conversion and processing on the received data. Dataconversion includes (but is not necessarily limited to) integer andfloating point encoding conversion and character code set conversion.The goal is to ensure all incoming event data is saved in one standardformat.

At step 514 any new entity and resource entries are created accordingly,based on the extracted standard event information 318, and at step 516the analyzer 10 proceeds to invoke the appropriate technology-specificlogic to process the technology-specific event information 320. Thisstep primarily deals with data conversions. At step 518 any newtechnology-specific resources are created accordingly based on the newdata. At step 520 a new entry in the analyzer 10 database is created forthe event information, while at step 522 event relations are generatedfor the newly added event (described below in relation to FIG. 6).Finally, at step 524 the appropriate data analysis tasks are performedon the newly added event data.

FIG. 6 illustrates the control flow for the above-mentioned eventrelation generation step 522. Before describing the various steps ofthis method, it should be noted that, in general, message path relationsare generated for any technology. As described before, for the MQSeries™the message path relation is primarily based on the MQPUT/MQPUT1 andMQGET relations. The underlying rationale of this process is to matchany MQPUT/MQPUT1 and MQGET calls referencing the same message at thesource and at the destination. Since an MQGET can be invoked in adestructive or browsing mode, it is possible that there may be more thanone non-browsing MQGET event for a given MQPUT/MQPUT1 event.

Several fields in the MQMD structure form what is known as the identityand origin context. This provides information on the origin of thecorresponding message. This information includes the following elements:

-   UserIdentifier: identifies the user that generates the message;-   AccountingToken: a security token associated with the message;-   ApplIdentityData: additional user-defined data supplied with the    message;-   PutApplType: a type of application (platform information) that    generates the message;-   PutApplName: a name of the application that generates the message;-   PutDate: the date when the message is put on a queue; and-   PutTime: the time when the message is put on the queue.

The application that puts the message can decide whether the informationis to be generated fresh by the queue manager, copied from previousMQGET call, customized by the application itself, or is void, i.e., noorigin context information is to be generated.

In the first case, i.e., the information is to be generated fresh by thequeue manager, the origin context provides strong evidence whether theMQPUT/MQGET calls match. However, the same is not true for the otherthree cases. For example, the application may be “propagating” messagesit receives to other recipients, and in this case it may decide to passon the origin context, rather than generating a new context.

The Message and Correlation IDs provide a unique identity for individualmessages. This information can be generated by the queue manager, or itcan be supplied by the application. Again, in the first case, i.e., theinformation is to be generated fresh by the queue manager, the analyzer10 can ensure the uniqueness of the message in the matching process.However, the same does not necessarily apply in the latter cases. Forexample, the application may have a logical error and generate the sameMessage and Correlation ID for all messages.

Describing FIG. 6 now in further detail, at step 610 the analyzer 10updates the lookup table 324 (FIG. 3) for the current event. The key forthe lookup table 324 comprises the message ID (24 bits), the correlationID (24 bits), and the message put time (16 bits). At steps 612 through622 a search is made to determine if any lookup table 324 entry alreadyexists with this key value. If not, the method creates a new lookuptable entry and exits at step 626. If a lookup table 324 entry alreadyexists with this key value, then at step 624 the method adds the currentevent to the value list associated with the matching key.

In more detail, at step 612 the analyzer 10 locates the lookup tableentry with the same key as the current event, and retrieves the list ofassociated events. At step 614 the method checks for a potentialmatching event, i.e, a check is made to determine if there is anypotential matching event generated from step 612 that has not beenexamined yet. If there is no further event, the process is completed(step 626). Otherwise, the method performs the following steps toconfirm whether the new event actually matches the current event in aMQPUT/MQGET relation.

At step 616 a check is made to determine if the PutDate fields match,i.e., if the PutDate field in the MQMD structure for the current eventand a matching candidate event match. If not, the method returns to step616 for a next potential matching event.

If the PutDate fields match, flow continues to step 618 to determine ifthe PutAppl fields match, i.e., if the PutAppl field in the MQMDstructure for the current event and the matching candidate event match.If not, the method returns to step 616 for a next potential matchingevent.

If the PutAppl fields match, flow continues to step 620 to determine ifthe PutType fields match, i.e., if the PutType field in the MQMDstructure for the current event and the matching candidate event match.If not, the method returns to step 616 for a next potential matchingevent.

If the PutType fields match, flow continues to step 622 to determine ifthe UserIdentifier fields match, i.e., if the UserIdentifier field inthe MQMD structure for the current event and the matching candidateevent match. If not, the method returns to step 616 for a next potentialmatching event.

Assuming that the UserIdentifier fields also match, at step 624 themethod confirms the matching event relation by declaring the candidateevent from the lookup table 324 as a matching event to the currentevent, and correspondingly updates the associated event relation record.Flow then returns to step 614 to process the next potential matchingevent.

In other embodiments of this invention more or less than theseparticular fields may be used to establish an event match/non-matchcondition.

FIG. 7 illustrates a presently preferred analyzer 10 COM modelinterface, and more specifically shows a relationship between theanalyzer 10 logic model 718 and system model 714, and a COM objectwrapper layer 722.

The analyzer 10 logic model 718 provides a mechanism to representdifferent technology functions and data structures in a uniform manner.The resource model, part of the analyzer 10 system, provides a techniqueto represent the technology-specific entities. That is, the logic model718 and the system model 714, when taken together, represent themonitored system environment and activities.

The display string generation capability (blocks 414 of FIG. 4) providedby the BCType class in the logic model 718 enable the analyzer 10components to illustrate the event data value in a technology-neutralfashion. However, this does not in and of itself enable the human userto manipulate the event data in data analysis or other tasks.

Scripting languages such as VBScript and JScript provide a means to theprogrammer to create objects in compiled languages such as C and C++,which are accessible to the scripting language. VBScript uses theMicrosoft COM automation interface to call into any programmer definedobjects from within a script. The Microsoft COM model is used to allow ahuman user to programmatically manipulate the event data. Thin “wrapper”objects based on the COM automation model are implemented on top of thelogic model 718 and the system model 714. Through the COM automationinterface, programs or scripts can be written to access the event datain a consistent manner. By employing the Visual Basic™ Scriptingsupport, the human user can design a script that handles the COM wrapperobjects. The scripts can be designed by the user to filter the set ofevents to be seen in the analyzer 10 human user interface (referred toas presentation filtering), or to perform other data analysis tasks. Thescripts may also be automatically generated by entry of data into a fewfields in a presentation filter dialogue box.

FIG. 7 shows the hierarchical relationship between the standard systementities and resources 710, the technology-specific system entities andresources 712, and the analyzer 10 system model 714. Also shown is thetechnology-specific event data 716 (320), which feeds into the logicmodel 718. The outputs of the system model 714, the logic model 718, andstandard event data 720 (318) are all inputs to the COM object wrapper722, which in turn provides an output to the Visual Basic™ scriptingunit 724.

FIG. 8 illustrates the relationship between presentation data filteringlogic and the COM object wrapper layer 722. A filter manager 810provides a portion of a simple user interface 812 for users to searchand filter on certain criteria. This user interface generates a VisualBasic™ script, which contains a set of rules corresponding to theselections made by the user. The generated script, via a script engine814, uses the COM object wrapper 722 to access analyzer internalcomponents such as the logic model 718, the system model 714 and thedatabase 20 (FIG. 1) to retrieve and filter data.

There may be times when the user interface 812 is not sufficient toperform advanced searches. In that case, the user can edit the generatedscript, generating user-modified or user-defined script 818, andleverage the power of Visual Basic™ to provide additional rules andconditions. For example, part of the user data message captured by aparticular sensor 14 may include a particular date of interest (e.g., adate that a previous loan obligation was satisfied). By knowing thenumber of bytes that this date is offset into the captured user messageportion, the user can modify the script to specifically look for a dateat this location in the event data region that meets some criterion(e.g., the date must be earlier than the current date, otherwise anerror condition exists).

In any case, once the script is obtained, either from the user interface812 or the user 818, the filter manager 810 invokes the Visual Basic™scripting engine 814 to run the script. As the script runs, thescripting engine 814 invokes the COM objects provided by the analyzerCOM model 722 to access the event data. The results of the script areplaced in another COM object (shown as well as the COM model 722). Thefilter manager 810 accesses the results COM object and then passes thedata back to a display or presentation portion 812A of the userinterface, where the results of the script are displayed in, forexample, a list format. Other types of scripts and scripting enginescould be employed as well, and the teachings of this invention are notlimited to using only Visual Basic™.

The following is an example of a VBScript script generated by the filteruser interface. In this case, the user input was to search the collectedevent data for all API “MQPUTs” which had a return code (parameter 7) of“MQCC FAILED”.

“EventsPool” is an analyzer 10 object which iterates through the eventdatabase. For each iteration, the object “esevent”, which contains eventdata, is created and filled in from the database. The “esevent” objectcontains methods and properties to access event data such as API name(“Method” property), host name (“Host” property), and other attributes.The “method” object in turn contains properties and values to get datafrom each parameter value. These methods and properties eventually callinto the analyzer 10 logic and system models. In this example, theseventh parameter of “MQPUT” is the return code. The “If” statementchecks for the value of the parameter being equal to “MQCC FAILED”. The“UIEvents” object is a list of events, and the output back to theanalyzer 10 user interface. If the condition matches, the event is addedto the “UIEvents” list of events to be displayed in the analyzer 10 userinterface 812.

-   MQCC FAILED=2-   For Each esevent In EventsPool    -   Set method=esevent.Method    -   paramvall=Null    -   If (esevent.Method.Name=“MQPUT”) Then        -   paramvall=method.GetParamvalue(7).Val    -   End If    -   If ((paramvall=MQCC FAILED)) Then        -   UIEvents.Add(esevent)    -   End if-   Next

The user could customize this simple script to perform more powerfulconditional filtering. For example, if the user desires to search forevents which have a result code of “MQCC FAILED” or of “MQCC WARNING”,the user could modify the script above as follows:

-   MQCC WARNING=1-   MQCC FAILED=2-   For Each esevent In EventsPool    -   Set method=esevent.Method    -   paramvall=Null    -   If (esevent.Method.Name=“MQPUT”) Then        -   paramvall=method.GetParamValue(7).Val    -   End If    -   If ((paramvall=MQCC FAILED) OR        -   (paramvall=MQCC WARNING)) Then            -   UIEvents.Add(esevent)    -   End If-   Next

Another use of the script could be to export selected data into files orto other applications which use the COM automation interface (722, FIGS.7 and 8), such as Microsoft Excel™.

FIGS. 9 and 10 illustrate the processes that the analyzer 10 uses togroup events automatically into related transactions, either within asingle thread of execution and unit of work (UOW, a local transaction)as in FIG. 9, or across multiple threads of execution, units of work,processes, and/or hosts (a global or business transaction), as in FIG.10.

In general, given a starting event (e) of interest to the user, thetransaction analysis module can locate other events that occurred withinthe same local or business transaction as the event of interest. Theuser interface 812A may then display for the user the subset of therecorded events that are within that transaction of interest. Thisallows the user to quickly focus on the events relevant to the problembeing analyzed.

A local transaction includes the operations (e.g., API calls such asMQPUT, MQGET and MQCMIT (commit)) that are performed during the timespan of a single unit of work (UOW). Operations performed within oneunit of work are either committed or are backed out together, so thatthe effects of these many operations all are either made permanent(committed) or reversed (backed out) as one atomic group. This is acommon feature of many transaction oriented technologies, includingdatabases and middleware.

A global or business transaction includes the operations done within oneor more related local transactions. When communication occurs betweenthe threads of execution of different units of work, these units of workare considered part of the same business or global transaction. Forexample, when a client process sends a message to a server process, itwill do so in the context of a local transaction, and the serverreceiving the message will similarly do so within a second localtransaction. The operations performed within these two localtransactions, both the communication operations that allow the twoprocesses to exchange data as well as any other computational operationswithin these local transactions, are thus part of the same businesstransaction.

Referring first to FIG. 9, at step 910 the user specifies an event (e)of interest, and at step 912 the analyzer locates the event of interestin the time-sorted set of database 20 events, S, for event e's thread ofexecution. The resulting position in S is denoted as P. At step 914 theevent at the current position in S is added to a set of events for thetransaction. A test is then made at step 916 to determine if this eventbegan the unit of work. If it did not, control passes to step 918 tofind a previous event in S, and a determination is made at step 920 if aprevious event exists in S. If there is no previous event, controlpasses to step 922 to set the current position in S back to p. Step 922is executed as well if the determination at step 916 is yes, otherwiseif a previous event is found to exist at step 920 control passes to step924. At step 924 a determination is made if the previous event is in thesame unit of work. If no, control passes to step 922, otherwise if yes,control passes back to step 914 where the event at the current positionin S is added to the set of events for the transaction, and the methodthen continues the search for the first event in the unit of work.Eventually the method will terminate the backwards (in time) search of Sand will execute step 922, after which control passes to step 926 wherea forward search through S is initiated. At step 926 a search is madefor the next event in S. If a next event does not exist (step 928)control passes to step 930 to terminate the method, and the events fromthe transaction of interest have been determined. If a next event in Sis found to exist at step 928 control passes to step 932 to determine ifthis next event is in the same unit of work. If no, control passes backto step 926 to find the next event in S, otherwise if yes, controlpasses to step 934 to add the event at the current position in S to theset of events for this transaction. At step 936 a test is made todetermine if this event ends the unit of work (e.g., was the capturedAPI call a MQCMIT for this UOW?) If no, control passes back to step 926to continue the forward search through S for adding associated event tothe transaction until the event that ends the UOW is located. Finally,at step 936 the event that ends the UOW is identified, and controlpasses to step 930 to terminate the method. At this time the list ofevents that make up the UOW can be displayed to the user for analysis.

FIG. 10 depicts the operation of the analyzer transaction correlationfunction at a higher (business transaction) level that can transcendmultiple threads and hosts. At step 1010 the method starts by the userspecifying an event of interest, and at step 1012 an empty (null) listof related events is created. At step 1014 the event of interest isadded to the list of related events, thereby providing one entry. Atstep 1016 a recursion is initiated, where the list is checked todetermine if it contains an entry. Since an event was just placed in thelist, the yes path is taken to step 1018 to remove the event (e), and acheck is made at step 1020 to determine if the event (e) has alreadybeen added to a set of transaction events. Assuming at this point thatit has not, control passes to step 1022, to find all events in the samelocal transaction, such as the same UOW, as event (e), including event(e). In this case the method shown in FIG. 9 is executed, as describedabove. At the completion of the execution of the method of FIG. 9,control passes to step 1024 to add each of the determined events (i.e.,those in step 930 of FIG. 9 corresponding to a UOW) to the set oftransaction events. Control then passes to step 1026 where, for each ofthe events from step 1024, all other events that share the same messagepath event relationship with these events are located, and added to thelist of related events. Control then reverts to step 1016. After one ofthe events is removed from the list, and if it has already been added tothe list of transaction events, then control passes back to step 1016 toremove the next event, otherwise control passes to step 1022 to executeagain the method of FIG. 9. Eventually, all events in the businesstransaction will have been found, and the method will terminate at step1028. What results is a set of connected or correlated events for atransaction that are collected across all processes. These transactionevents can then be displayed to a user in a common format for review andanalysis, which is a desired result of the teachings found herein.

As was described above, the analyzer 10 makes use of the COM objectmodel 722 and a Visual Basic™ scripting engine 814 to allow a human userto interact with the internal data model and runtime event data.

FIG. 13 is a block diagram of a distributed enterprise middleware-basedsystem 1300 that includes the analyzer 10 and related components inaccordance with the teachings described above. The system 1300 isassumed to be, for this example, a system that receives datarepresenting mortgage applications from on-line users or customers 1310via a global data communications network such as the internet 1320. Oneor more client machines 1330 receive the mortgage applications from theinternet 1320 and provide them to an application (mortgage requestprocessing) server 1340. The server 1340 parses various data fields ofthe mortgage requests and sends messages to various distributedapplications running on a plurality of hardware/software platforms orprocessors so as to process the mortgage requests. For example, theseapplications can include a credit check application 1350, a taxassessment application 1360, a verify income application 1370, a titlesearch application 1380 and an appraisal application 1390. The variousapplications could all be localized in one facility, or they could bedistributed over a large geographical area. One or more of theapplications (e.g., the credit check application), may be associatedwith another business entity altogether, who may or may not employ theteachings of this invention. In this case, a sensor 14 may not beinstalled on the associated application. However, the input and outputmessage queues to and from this processing entity/application can bemonitored to obtain some knowledge as to the operation thereof.

It should be noted that some of these applications may require humanintervention. For example, the appraisal application will typicallyrequire that an appraiser actually examine the property for which themortgage is being sought. As such, the various applications can differwidely in their response times (e.g., seconds to days or even weeks).

The various applications in turn output their respective results to amortgage request evaluation application 1395, which in turn eventuallyprovides a response back to the client machine(s) 1330, such as‘approved’, ‘disapproved’, ‘conditionally approved’, etc.

The various functional elements shown in FIG. 13 can be executed on aplurality of diverse operating platforms using a plurality of differenttypes of operating systems, data formats, internal data representations,etc. As can be appreciated, if erroneous results are obtained, it isimportant to determine the source of the problem so that the problem canbe corrected. However, this task is complicated by the fact that somethousands of different mortgage requests may be in process at any giventime, in various stages of completion.

A message-oriented middleware system, such as the above-mentionedMQSeries™, operates over the various processors and components of thesystem 1300, and provides message queues (Q). Messaging is preferablyemployed to send data between processors (instead of calling each otherdirectly), and the queues facilitate the messaging function bytemporarily storing the messages so that the various programs andapplications can run independently and asynchronously relative to oneanother. Although not shown in FIG. 13, it is typically the case, butnot required, that a queue manager will be resident on each of theprocessors to manage and control the storage and retrieval of messagesin the queue(s).

In accordance with the teachings of this invention a plurality of thesensors 14 are operated with the various applications to selectivelycapture event data based on the configuration data and commands sentfrom the analyzer 10. The captured event data flows back to the analyzer10 from the sensors 14, and is analyzed as described above to isolateand track the flow of one or more transactions. In this manner theoperator can determine, for example, if an application generated aproper message and/or if another application actually received themessage, the underlying reason when a failure code is reported, whethera particular message was properly formatted, whether a receivingapplication generated a reply to a particular message and, relatedly, ifthe sending application actually received the reply, the timingassociated with message processing, and whether a particular messagegenerated at one level or tier of a hierarchical system actuallypropagated to other level(s) as intended.

Through the user interface 12 the operator is enabled to formulate, viathe scripting capabilities, desired transaction views and eventselections, and to sort the collected event data by, for example, time,call type, queue, queue manager, host, process thread and othercriteria. By selecting events in one or more of the presented views ofthe event data, the operator is enabled to then “drill down” into moreof the details of the captured event, such as the message descriptor andthe user data. That is, instead of simply being presented with streamsof numbers and return codes (see FIG. 15), the analyzer 10 presents thetransaction event information in a human readable and comprehendibleformat.

Further in this regard, and referring to FIG. 14, the analyzer console12 is the primary point of interface for diagnosing problems in theapplications. The analyzer console 12 receives event messages from thesensors 14, stores the event messages in the transaction database 20,and operates on the stored event data with a data analysis module 19C,as described above. The analyzer console 12 also includes other logicaland functional blocks, including a sensor filter configurationmanagement block 19A, a sensor data collection management block 19B, agraphical presentation logic block 19D, and a communications block 19E.

The graphical presentation logic block 19D cooperates with the othercomponents of the analyzer to provide a plurality of views of thecaptured event data. One view is referred to as a component layout viewwhich graphically displays the components of the overall distributedsystem being monitored, including the message queues (Q) being used,hosts and processes involved, and which process (application) is incommunication with which queue (Q). The links between queues and theprocesses are preferably displayed using lines or arcs, where athickness (or color or some other visual characteristic) is employed toindicate an amount of message traffic passing through the process/queuelink. The resulting view may resemble FIG. 13, with the links betweenapplications and queues (Q) being annotated or otherwise visuallyindicating an amount of message traffic.

Another view is referred to as dynamic transaction visualization (FIG.16 presents one example), where transactions are shown as they happen orhave happened, across multiple hosts, operating systems andapplications. Presentation filters can be employed to reduced thedisplay to only the events that are applicable to a particulartransaction, thus allowing rapid analysis of transaction problems. Notethat in FIG. 16, in addition to the various hosts and application shownin FIG. 13, an Asset Verification application 1355 has been added aswell.

Another view is referred to as an event history, where the operator isenabled to view all captured events at a level of detail specified bythe operator. These details can include, but are not limited to, themessage queue that the event was placed in, the originating applicationand host, and the return code from a call in a human readable format (asopposed to a number). The event data can also be sorted by any of thesefields so that the events can be viewed in chronological order, from aparticular process or host, or by any of a plurality of event-viewingcolumns.

Referring to FIG. 17, the event data can also be viewed in what isreferred to as an event details mode. By specifying a particular event,the operator is enabled to view even more detail than is present in theevent history view. The event details can include, by example, all ofthe information in the message header, a “dead letter” queue header, andalso user data in the message. Also, return codes can be displayed sothat they are readable, e.g., MQRC_SYNCPOINT_LIMIT_REACHED, as opposedto simply the return code “2024”. Also, the analyzer 10 may providehypertext links to the middleware documentation, so that by clicking ona particular return code the operator is enabled to obtain more specificinformation directly from the provider of the middleware.

As an aid in identifying problems, certain error conditions may becolor-coded to make them visually distinct. For example, an invalidreturn code from an MQI call can be displayed in red so that theoperator can quickly see that a particular MQI call is failing. The samecould be performed for an MQCONN call, enabling the operator to seeconnections to a message queue that is failing.

The above-described views provide a significant advantage over theconventional techniques for debugging and analyzing problems that arisein a distributed middleware-based system. For example, FIG. 15 shows anexemplary content of a log file used to record message traffic after atracing facility is enabled in the MQSeries™ system. In FIG. 15 the datais actually truncated, as normally the complete function names andreturn codes are present. Also, the return codes are given as values,not as literals. It should be apparent that attempting to trace a giventransaction across multiple hosts and operating systems is not a simpletask, as a number of such records may need to be printed, and thevarious API calls and data then visually matched.

The analyzer 10, in accordance with the teachings herein, simplifies andautomates this error analysis and transaction trace processing, and canprovide the operator with messages and other data relating to a singletransaction of interest, obtained from the suitably configured sensors14 that are strategically located through the distributed dataprocessing system.

The analyzer 10, in addition to capturing message event data in realtime, can be used with pre-recorded data.

While some conventional management and monitoring tools are known foruse with middleware systems, such as the MQSeries™, these conventionaltools typically focus on system data, such as queue status. Inaccordance with the foregoing teachings, it can be appreciated that theanalyzer 10 instead provides logical diagnosis information to theoperator (such as API calls, call arguments, return values, etc.).Furthermore, the analyzer 10 correlates API calls made from differentcomponents of the distributed system to form a complete transactionalview, including a graphical depiction of the distributed system (similarto, for example, FIG. 13).

While described primarily in the context of the MQSeries™ middlewaresystem, the teachings of this invention have application to a number oftypes of systems and technologies including, but not limited to, thoseknown as CGI/HTTP, ISAPI, NSAPI, CORBA and COM/DCOM. The teachings ofthis invention are thus not limited for use with only those technologiesthat are based on a message passing architecture.

Also, while described above primarily in the context of a developmenttool, it should be realized that the analyzer 10 can be used as well ina production monitoring capacity. That is, once a particular businessapplication (such as the exemplary mortgage processing application shownin FIG. 13) has been developed and deployed, the analyzer 10 can be usedto identify and diagnose problems as they occur in the productionenvironment.

Based on the foregoing it can be appreciated that the teachings hereinenable providing each stored event in the event database with a uniqueID, thereby facilitating the rapid retrieval of a specific event fromthe event database.

Furthermore, by using the record address as the event ID, the datamanager is enabled to provide various cursors to access events accordingto various criteria, without requiring that the database be locked upduring cursor manipulation. The event cursor enables the operator toenumerate through events one at a time, based on certain conditions,without having to read all events into memory.

Furthermore, the analyzer 10 provides event relationship lookup recordsto assist the transaction analysis algorithm. The lookup record providesa high performance, fast access to a list of events with the sameattribute value. Without this persistent nature of the lookup records inthe event database 20, a runtime transaction analysis for hundreds ofsome tens or hundreds of thousands of events would become impractical.

Still further in accordance with the foregoing teachings, the analyzer10 provides a technique to match entry and exit events by saving theentry and the exit for one API call as one event in the event database20. In order to accomplish this the analyzer data manager provides aunique ID value for entry and exit events for the same API call so thatthe event matching algorithm need search only one field, and furthermorepreferably constructs a most-recently-stored (MRS) events list in memoryso that the performance of the matching process is dramaticallyimproved.

The analyzer 10 database is preferably designed to be technologyneutral, which means that the database 20 and related code can beexpanded to support different technologies with little or no changes. Inorder to achieve the capability of being technology neutral, the recordsin the database 20 for technology-specific resources preferably containat least a type and a name, and may have as many attribute records aschildren as needed. In addition, a resource record can be made recursiveto satisfy the case of events associated with layered resources. Thedatabase 20 and its data manager preferably work with theabove-mentioned technology-specific module, for example a technologyhelper library which is loaded dynamically according to need in order tointerpret the technology-specific contents of the event database.

Thus, while the invention has been particularly shown and described withrespect to preferred embodiments thereof, it will be understood by thoseskilled in the art that changes in form and details may be made thereinwithout departing from the scope and spirit of the invention.

1. A computer-implemented method for monitoring an operation of atransaction processing system, comprising steps of: intercepting a firstApplication Program Interface (API) call; examining said first API call,and if said first API call meets predetermined API call criteria,storing all or a portion of a content of said first API call as a firststored event; intercepting a second API call; examining said second APIcall, and if said second API call meets said predetermined API callcriteria, storing all or a portion of a content of said second API callas a second stored event; determining that said first API call is a partof a same particular business transaction as said second API call if:(a) said first stored event indicates said first API call sent amessage, and said second stored event indicates said second API callreceived said message, or (b) said first and second stored eventsindicate said first and second API calls were conducted in a sametransactional unit of work; and if said first API call is a part of saidsame particular business transaction as said second API call, employingall or a portion of said first and second stored events in a subsequentprocess.
 2. A method as in claim 1, wherein the API call criteriacomprises system entity identity.
 3. A method as in claim 1, wherein theAPI call criteria comprises API name.
 4. A method as in claim 1, whereinthe API call criteria comprises timing data.
 5. A method as in claim 1,wherein the API call criteria comprises restrictions on parameter valuesto the API call.
 6. A method as in claim 1, wherein the step ofintercepting said first API call includes a step of operating a sensorthat is automatically enabled for responding to an occurrence of anerror condition or a change in program states or environments.
 7. Amethod as in claim 1, wherein the step of intercepting said first APIcall includes a step of operating a sensor that is automatically enabledupon an occurrence of at least one pre-programmed triggering event, thesensor thereafter capturing all event data that satisfies a specificdata collection filter.
 8. A method as in claim 1, wherein the step ofemploying includes a step of processing the first and second storedevents using a script.
 9. A method as in claim 8, wherein the script isa pre-programmed script.
 10. A method as in claim 8, wherein the scriptis automatically generated by entry of data to a plurality of fields ona presentation filter dialogue box.
 11. A method as in claim 8, whereinthe script is an operator-defined script.
 12. A method as in claim 1,wherein the step of intercepting said first API call comprises steps of:installing a sensor between an output of an application and a functioncall library for emulating, relative to the application, an interface tothe function call library; and storing the predetermined API callcriteria in a memory that is accessible by said sensor, and wherein thestep of examining said first API call comprises: determining if thefirst API call fulfills the predetermined API call criteria; and if amatch occurs, capturing data representing all or a portion of thecontent of the first API call and transmitting the captured data to adatabase for storage as said first stored event.
 13. A method as inclaim 1, wherein the step of intercepting said first API call comprisessteps of: installing a sensor between an output of an application and afunction call library for emulating, relative to the application, aninterface to the function call library; and programming thepredetermined API call criteria into a memory that is accessible by saidsensor.
 14. A method as in claim 1, wherein the step of determiningincludes a step of correlating an entry event with an exit event.
 15. Amethod as in claim 1, wherein the step of determining includes a step ofcorrelating a message queue put event with a message queue get event.16. A method as in claim 1, wherein the step of storing all or a portionof said content of said first API call includes a step of storing saidfirst stored event in an event database with a unique ID, and whereinthe step of determining includes a step of locating said first storedevent in the event database using the unique ID.
 17. A method as inclaim 1, wherein the step of storing all or a portion of said content ofsaid first API call stores an entry and an exit for said first API callas one event in an event database and provides a unique ID for saidfirst stored event, such that a matching algorithm need search only onefield.
 18. A method as in claim 17, wherein the step of storing all or aportion of said content of said first API call further constructs amost-recently-stored (MRS) events list for improving the performance ofthe matching algorithm.
 19. A method as in claim 1, wherein the step ofstoring all or a portion of said content of said first API call employsa record address as a stored event ID, and wherein the step ofdetermining provides cursors to access events according to variouscriteria, including an ability to enumerate through events one at atime, without requiring that all events be read into memory.
 20. Amethod as in claim 1, wherein the step of storing all or a portion ofsaid content of said first API call provides event relationship lookuprecords to assist the performance of the step of determining byproviding fast access to a list of events with a same attribute value.21. An analyzer system for monitoring operation of a transactionprocessing system, comprising: a first programmable sensor forintercepting and examining a first Application Program Interface (API)call to determine if said first API call meets programmed API callcriteria, said first sensor being responsive to a condition that if saidfirst API call meets the programmed API call criteria, capturing all ora portion of a content of said first API call; a second programmablesensor for intercepting and examining a second API call to determine ifsaid second API call meets the programmed API call criteria, said secondsensor being responsive to a condition that if said second API callmeets the programmed API call criteria, capturing all or a portion of acontent of said second API call; an analyzer console bidirectionallycoupled to said first and second sensors, said analyzer consolecomprising: (i) an event database having inputs coupled to said firstand second sensors for storing said captured content of said first andsecond API calls as first and second stored events, respectively; (ii) adata manager coupled to said event database for determining that saidfirst API call is a part of a same particular business transaction assaid second API call if: (a) said first stored event indicates saidfirst API call sent a message, and said second stored event indicatessaid second API call received said message, or (b) said first and secondstored events indicate said first and second API calls were conducted ina same transactional unit of work; and (iii) a user interface fordisplaying all or a portion of said first and second stored events, ifsaid first API call is a part of said same particular businesstransaction as said second API call; and a module that employs all of aportion of said first and second stored events in a subsequent process.22. An analyzer system as in claim 21, wherein said API call criteriacomprises at least one of a system entity identity, an API name, timingdata, and restrictions on parameter values to the API call.
 23. Ananalyzer system as in claim 21, wherein said first sensor isautomatically enabled for responding to an occurrence of an errorcondition or a change in program states or environments.
 24. An analyzersystem as in claim 21, wherein said first sensor is automaticallyenabled upon an occurrence of at least one programmed triggering event,and wherein the first sensor thereafter captures all event data thatsatisfies a specific data collection filter.
 25. An analyzer system asin claim 21, wherein said data manager processes said first and secondstored events using one of a pre-programmed script or anoperator-defined script, and wherein the script can be automaticallygenerated by entry of data to a plurality of fields on a presentationfilter dialogue box.
 26. An analyzer system as in claim 21, wherein saidfirst sensor is installed between an output of an application and afunction call library for emulating, relative to the application, aninterface to the function call library, and wherein predetermined APIcall criteria are programmed into a memory that is accessible by saidfirst sensor.
 27. An analyzer system as in claim 21, wherein if saidfirst sensor determines that said first API call fulfills the programmedAPI call criteria, said first sensor transmits said captured content ofsaid first API call to said event database for storage as said firststored event.
 28. An analyzer system as in claim 21, wherein said datamanager operates on said first and second stored events to at least oneof correlate an entry event with an exit event, and correlate a messagequeue put event with a message queue get event.
 29. An analyzer systemas in claim 21, wherein said first stored event in said event databaseis provided a unique ID for locating said first stored event in theevent database.
 30. An analyzer system as in claim 21, wherein saidfirst stored event in said event database comprises an entry and an exitfor said first API call and is identified with a unique ID, such that anevent matching algorithm need search only one data field.
 31. Ananalyzer system as in claim 21, wherein said first stored event in saidevent database comprises technology neutral event information, andwherein said second stored event in said event database comprisestechnology specific event information.
 32. An analyzer system as inclaim 31, further comprising a technology-specific module that isinvoked as needed to interpret said technology specific eventinformation.
 33. An analyzer system for monitoring operation of a dataprocessing system, comprising: a sensor for (a) intercepting a firstApplication Program Interface (API) call, (b) examining said first APIcall, and (c) capturing first data from said first API call, if saidfirst API call meets a criterion; and an analyzer console including (a)an event database having an input coupled to said sensor for storingsaid data, (b) a data manager coupled to said event database fordetermining that said first API call is logically related to a secondAPI call if: (i) said first data indicates said first API call sent amessage, and second data from said second API call indicates said secondAPI call received said message, or (ii) said first data indicates saidfirst API call was conducted in a transactional unit of work, and saidsecond data indicates said second API call was also conducted in saidtransactional unit of work, and (c) a module that employs said firstdata and said second data in a subsequent process, if said first APIcall is logically related to said second API call.
 34. The analyzersystem of claim 33, wherein said data processing system is a distributeddata processing system that includes a first processor and a secondprocessor, wherein said first API call is invoked by a first processrunning on said first processor, and wherein said second API call isinvoked by a second process running on said second processor.
 35. Theanalyzer system of claim 33, wherein said subsequent process displayssaid data.
 36. The analyzer system of claim 33, wherein said criterionis programmable, and wherein said analyzer console providesconfiguration data to said sensor for programming said criterion.